Sampling-Based Estimation of the Number of Distinct Values of an Attribute
نویسندگان
چکیده
We provide several new sampling-based estimators of the number of distinct values of an attribute in a relation. We compare these new estimators to estimators from the database and statistical literature empirically, using a large number of attribute-value distributions drawn from a variety of real-world databases. This appears to be the first extensive comparison of distinct-value estimators in either the database or statistical literature, and is certainly the first to use highlyskewed data of the sort frequently encountered in database applications. Our experiments indicate that a new “hybrid” estimator yields the highest precision on average for a given sampling fraction. This estimator explicitly takes into account the degree of skew in the data and combines a new “smoothed jackknife” estimator with an estimator due to Shlosser. We investigate how the hybrid estimator behaves as we scale up the size of the database.
منابع مشابه
Estimation of Flow Zone Indicator Distribution by Using Seismic Data: A Case Study from a Central Iranian Oilfield
Flow unit characterization plays an important role in heterogeneity analysis and reservoir simulation studies. Usually, a correct description of the lateral variations of reservoir is associated with uncertainties. From this point of view, the well data alone does not cover reservoir properties. Because of large well distances, it is difficult to build the model of a heterogenic reservoir, but ...
متن کاملStep change point estimation in the multivariate-attribute process variability using artificial neural networks and maximum likelihood estimation
In some statistical process control applications, the combination of both variable and attribute quality characteristics which are correlated represents the quality of the product or the process. In such processes, identification the time of manifesting the out-of-control states can help the quality engineers to eliminate the assignable causes through proper corrective actions. In this paper, f...
متن کاملEstimation of Total Organic Carbon from well logs and seismic sections via neural network and ant colony optimization approach: a case study from the Mansuri oil field, SW Iran
In this paper, 2D seismic data and petrophysical logs of the Pabdeh Formation from four wells of the Mansuri oil field are utilized. ΔLog R method was used to generate a continuous TOC log from petrophysical data. The calculated TOC values by ΔLog R method, used for a multi-attribute seismic analysis. In this study, seismic inversion was performed based on neural networks algorithm and the resu...
متن کاملTriangular Intuitionistic Fuzzy Triple Bonferroni Harmonic Mean Operators and Application to Multi-attribute Group Decision Making
As an special intuitionistic fuzzy set defined on the real number set, triangular intuitionistic fuzzy number (TIFN) is a fundamental tool for quantifying an ill-known quantity. In order to model the decision maker's overall preference with mandatory requirements, it is necessary to develop some Bonferroni harmonic mean operators for TIFNs which can be used to effectively intergrate the informa...
متن کاملDetermining the Sample size for Estimation of the CCC-R Control Chart Parameters Based on Estimation Costs
In today's highly competitive industrial environment due to fast technology development, quality practitioners will to detect out-of-control situations and take actions whenever is necessary as soon as possible. Accordingly, new statistical procedures have been enhanced incessantly both to handle high yield processes along with looking for methods of minimizing all quality cost. CCC-r chart, th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1995